GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty#48718
GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty#48718wgtmac merged 5 commits intoapache:mainfrom
Conversation
|
|
|
@wgtmac Hi would you mind to also review this? Thanks! |
|
I think we can clear |
|
Clearing after |
|
|
…ch is empty When writing CSV, if the first record batch was empty, the header would be written twice. This happened because: 1. Header is written to data_buffer_ and flushed during initialization 2. TranslateMinimalBatch returns early for empty batches without modifying data_buffer_ 3. The loop then writes data_buffer_ which still contains the header The fix clears the buffer (resize to 0) when encountering an empty batch, so the subsequent write outputs nothing. Added C++ and Python tests for empty batches at start and in middle of tables. Claude-Generated-By: Claude Code (cli/claude-opus-4-5=1%) Claude-Steers: 2 Claude-Permission-Prompts: 2 Claude-Escapes: 1
Signed-off-by: Ruiyang Wang <ruiyang@anthropic.com>
Addresses review feedback from HuaHuaY: Instead of clearing the buffer in TranslateMinimalBatch for empty batches, use a WriteAndClearBuffer() helper that writes and clears the buffer in all write paths. This is cleaner because: - Every write follows the same pattern (write -> clear) - Easier to reason about for future write stages - The invariant is explicit: buffer is always clean after flush
Co-authored-by: Gang Wu <ustcwg@gmail.com>
… tests - Rename WriteAndClearBuffer to FlushToSink (shorter, clearer intent) - Consolidate Python tests into a single parameterized test with 3 cases: empty batch at beginning, middle, and end
902ca15 to
3cc81a2
Compare
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 0d0e068. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Fixes #36889
When writing CSV from a table where the first batch is empty, the header gets written twice:
What changes are included in this PR?
The bug happens because:
data_buffer_and flushed duringCSVWriterImplinitializationTranslateMinimalBatchreturns early without modifyingdata_buffer_data_buffer_which still contains stale contentThe fix introduces a
WriteAndClearBuffer()helper that writes the buffer to sink and clears it. This helper is used in all write paths:WriteHeader()WriteRecordBatch()WriteTable()This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.
Are these changes tested?
Yes. Added C++ tests in
writer_test.ccand Python tests intest_csv.py:Are there any user-facing changes?
No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.